The general aim is to create a machine learning model that will identify and extract accession numbers relating to biological entities, notably DNA, protein sequences and samples from patent texts. The main focus is on identifying and extracting accession numbers from the International Nucleotide Sequence Database Collaboration (INSDC) consisting of the European Nucleotide Archive (ENA), the National Center for Biotechnology Information (NCBI) and the DNA Database of Japan (DDBJ). This involves capturing a range of organisation names associated with accession numbers in texts as well as patterns for accession numbers that are typically numeric (138973) or alphanumeric (NM138973) in form.
A secondary aim is to explore whether it is possible to link
accession numbers referenced in the description section of patent texts
(the main body of a patent application) with references to sequences
appearing in the patent claims. Typically, patent applicants make
reference to sequences using the accession number and the expression
SEQ ID NO: x in the description. However, in the claims
section of an application only the SEQ ID will appear. This makes it
difficult to interpret exactly what sequence is involved in the patent
claims and the role of a sequence in the claims (e.g. as reference
material or as material to or part of the claimed invention). It is
therefore desirable to explore whether the connection between the
sequence identifiers in the claims and accession numbers can be
established in an automated way.
Natural Language Processing involves extracting named entities such as persons, organisations, geographic place names etc. from texts. Over the last ten years traditional text mining techniques have increasingly been replaced or combined with machine learning models, notably deep learning models based on convolutional neural networks (CNNs) using word vectors (e.g. word2vec) for use on CPU and the state of the art BERT transformer models (Bidirectional Encoder Representations from Transformers) models designed for use on GPU. A popular machine learning library for NLP in production is the spaCy python library that allows for the use of either vector or transformer models. In contrast with other popular models such as NLTK or scikit-learn, spaCy is directed to use in production where project templates can be used to train a model and make it immediately available for use in production as a python (pip) installable package. While the state of the art in named entity recognition is represented by transformer models on GPU, for the present task we preferred to use the more readily accessible CPU based spaCy medium English model with word vectors as the basis for the accession recognition tasks identified above. This means that the output model can be readily used in production and installed by users as a standard python package on a laptop.
NLP models are typically pre-trained on a large corpus of texts and are able to make predictions about entity types for a set of standard labels ‘out of the box’ (e.g. DATE, CARDINAL, QUANTITY, PERSON, ORG, GPE for geo political entities such as countries, LOC for locations such as mountain ranges etc). However, it is normally necessary to improve the predictions made by a pre-trained model for texts from a specific domain (e.g. scientific or patent texts) and to add additional labels. In the context of the present research we will focus on improving capture for the pre-trained ORG label for organisations and add the ACCESSION and SEQ labels.
Improvements and additions to the labels in pre-trained models require the availability of annotated texts in the domain of interest that contain the desired labels (ORG, ACCESSION, SEQ). A machine learning model is trained using the annotated texts with a portion of the annotated texts for training (typically 70%) and a portion for evaluation or testing of model performance (typically 70%). The evaluation or test set is not seen by the model during training. This allows for the trained model to be evaluated against a set of data where the answer is already known. Evaluation involves the use of a range of statistics notably the calculation of the loss rate, recall and precision with the harmonic mean between precision and recall constituting the F-score used to assess the balance of performance across precision and recall.
The most important challenge in NLP tasks is generating sufficient annotations in the domain to train a model to perform to a useful standard (e.g. 80% of target entities or above depending on the task). In the present research we used a table of short texts in patents containing references in USPTO granted patents (48.6 million rows) that is made available through the USPTO PatentsView data service. Annotations were manually created and adjusted using the Prodigy annotation tool from the makers of spaCy (explosion.ai). This allows for tight integration between the creation of annotations and training models.
Annotation and training is an iterative process involving multiple steps that is directed towards creating a good general model for the specific task. That is, a model that performs well in precision and recall of the desired entities against previously unseen texts in the domain of interest. It is important to emphasise that the aim is not to create a model that performs perfectly on the training data as this will result in over-fitting (a model that is perfect on training data but performs poorly on unseen data).
Producing annotations for new labels involves overcoming the
cold start problem (a lack of sufficient data for a label
to train a model). To overcome this problem we searched the USPTO
references table for all mentions of the word accession, or the
organisation names and their full name variants and acronyms (EMBL, EBI,
ENA, NCBI, DDBJ etc) to produce a raw set of 841,728 example texts from
14,167 granted US patents (see
assets/accession_refs_id.csv). In addition, using a
separate dataset of worldwide patent data that is known to contain
sequences from the Lens patent database we identified a set of 348,076
texts containing the same search terms
(assets/accession_sent_nchar2000.csv) from xxx documents [note: could
usefully exclude shared docs or exact match texts]. The aim of these
datasets is to use samples of the texts for annotation and testing of
models.
An important strength of spaCy and the Prodigy annotation tool is that it is possible to combine a machine learning model with a dictionary or thesaurus of labelled terms (such as ORG for ENA or European Nucleotide Authority). In the language of spaCy this is called an ‘entity_ruler’ and is written in JSON Lines (jsonl) format where each line is a valid JSON object.
{“label”:“ORG”,“pattern”:[{“lower”:“ncbi”}], “id”: “ncbi”} {“label”:“ORG”,“pattern”:[{“lower”:“national”},{“lower”:“center”},{“lower”:“for”},{“lower”:“biotechnology”},{“lower”:“information”}], “id”: “ncbi”}
In spaCy and Prodigy it is possible to write patterns that either
explicitly capture a term as is (known as ORTH) or as in the example
below to use lower to capture both uppercase and lower case
versions of a string. In addition, it is also possible to use regular
expressions and this proved particularly important for capturing
accession numbers. We used regular expressions developed by the European
Nucleotide Authority1 with additions to reflect the use of white
space in identifiers in patent texts.
{“label”: “ACCESSION”, “pattern”: [{“text”:{“REGEX”: “PRJ(E|D|N)[A-Z][0-9]+”}}]} {“label”: “ACCESSION”, “pattern”: [{“text”:{“REGEX”: “(E|D|S)RP[0-9]{6,}”}}]} {“label”: “ACCESSION”, “pattern”: [{“text”:{“REGEX”: “SAM(E|D|N)[A-Z]?[0-9]+”}}]}
In Step 1 a set of 957 annotations were developed in Prodigy using the English medium model and the patterns set to pre-highlight matches. These annotations were exported from Prodigy into spaCy format as a corpus of train (70%) and eval (30%).
# transfer the annotations to spaCy format splitting into train and eval
prodigy data-to-spaCy corpus --ner accession_usptorefs,accession_sent_nchar2000rerun --eval-split 0.3
# set up the corpus in the project (writes to train and dev for eval)
python -m spaCy train corpus/config.cfg --paths.train corpus/train.spaCy --paths.dev corpus/dev.spaCy
For the training a spaCy project was created from an existing template containing commands for training, adding an entity ruler and packaging a model as a pip installable.
spaCy project run train
The performance of the model over ten iterations is displayed in Figure @ref(fig:step1) below. The scores of interest are recall ENTS_R and precision ENTS_P and the F score (ENTS_F) consisting of the harmonic mean between precision and recall to evaluate the overall performance of the model.
Evaluating a starter model
The F score score of 90.84 suggest that we have a good model. However, note that the training and evaluation sets are relatively small compared with the scale of real world data. Nevertheless, This illustrates that NLP models can be trained with a relatively small number of examples as a starter set and produce meaningful results.
Evaluation of the performance of the starter model was carried out by processing the full 817,000 texts and inspecting the results for valid and noisy results. In the table below we display the totals for the organisation label and the accession label.
The Organisation (ORG) label
library(tidyverse)
accession_ref_all_id_results <- read_csv("results/accession_ref_all_id_results.csv")
accession_ref_all_id_results %>%
filter(entity_label == "ORG") %>%
count(entity_id, entity_text, sort = TRUE) %>%
head(., 20) %>%
knitr::kable()
| entity_id | entity_text | n |
|---|---|---|
| ncbi | GenBank | 21376 |
| ncbi | NCBI | 11496 |
| embl | EBI | 7545 |
| embl | EMBL | 5784 |
| ncbi | Genbank | 5580 |
| ncbi | National Center for Biotechnology Information | 4905 |
| NA | American Chemical Society | 3417 |
| uniprot | UniProt | 2692 |
| NA | Journal of Biological Chemistry | 2204 |
| NA | National Institutes of Health | 2112 |
| NA | ACM | 1884 |
| uniprot | UNIPROT | 1626 |
| NA | American Physiological Society | 1050 |
| NA | American Heart Association | 937 |
| atcc | ATCC | 716 |
| uniprot | Uniprot | 708 |
| NA | American Society of Nephrology | 498 |
| NA | American Heart Association, Inc. | 446 |
| NA | GNPD | 443 |
| NA | American Institute of Physics | 425 |
In the case of organisation label we can observe that there are multiple forms of the database name GenBank that we class as an organisation and assign to the entity id ncbi in the entity_ruler to allow for aggregation. We also observe that the ORG label is capturing names such as ’Journal of Biological Chemistry` that are not organisations. This and other examples demonstrate the challenge for a model in discriminating types of entities that are typically composed of noun phrases.
Our second label for ACCESSION identifiers also reveals issues involving noise.
accession_ref_all_id_results %>%
filter(entity_label == "ACCESSION") %>%
count(entity_id, entity_text, sort = TRUE) %>%
head(., 30) %>%
knitr::kable()
| entity_id | entity_text | n |
|---|---|---|
| NA | PCT | 497 |
| NA | 03C 3137 | 393 |
| NA | AB 2011 | 349 |
| NA | 4 | 250 |
| NA | LG Optimus | 249 |
| NA | Angewandte | 239 |
| NA | BACE1 | 169 |
| NA | ACM | 163 |
| NA | www.clinicaltrials.gov/ct2/show/NCT01390831 | 154 |
| NA | Angove | 153 |
| NA | R1524 | 153 |
| NA | Dermatol | 151 |
| NA | Medtronic | 131 |
| NA | R1517 | 129 |
| NA | P15647 | 124 |
| NA | J05199 | 122 |
| NA | 100300 | 117 |
| NA | R1599 | 115 |
| NA | EMC Corporation | 113 |
| NA | 4th | 108 |
| NA | ASME | 108 |
| NA | LG KE850 | 108 |
| NA | 5th | 107 |
| NA | 11β | 100 |
| NA | 11669 | 99 |
| NA | 7356 | 98 |
| NA | 8220;Semiconductor Chip Assembly | 98 |
| NA | Samsung F700 | 98 |
| NA | AAB10482 | 94 |
| NA | F1209 | 92 |
In this case we have a mix of organisation names, accession like names, a url and an accession number.
As this exposes, it is one thing to achieve good scores when training a model on a small set of annotations, and another for that model to perform in the same way on real world unseen data. Nevertheless, while the noise level was high, there were also many thousands of valid hits that were generated by this approach as suggested in the table below.
accession_ref_all_id_results %>%
head(20) %>%
knitr::kable()
| patent_id | id | entity_id | entity_text | entity_label | entity_start | entity_end | entity_id_no | country_code |
|---|---|---|---|---|---|---|---|---|
| USD581385 | D581385 | NA | Bird | ORG | 0 | 4 | 0.000000e+00 | US |
| US7491532 | 7491532 | ncbi | NCBI | ORG | 0 | 4 | 4.160624e+18 | US |
| US7491532 | 7491532 | NA | X02996 | ACCESSION | 37 | 43 | 0.000000e+00 | US |
| US7491532 | 7491532 | NA | J01967 | ACCESSION | 44 | 50 | 0.000000e+00 | US |
| US7491532 | 7491532 | NA | J01968 | ACCESSION | 51 | 57 | 0.000000e+00 | US |
| US7491532 | 7491532 | NA | J01970 | ACCESSION | 58 | 64 | 0.000000e+00 | US |
| US7491532 | 7491532 | NA | J01971 | ACCESSION | 65 | 71 | 0.000000e+00 | US |
| US7491532 | 7491532 | NA | J01972 | ACCESSION | 72 | 78 | 0.000000e+00 | US |
| US7491532 | 7491532 | NA | J01974 | ACCESSION | 79 | 85 | 0.000000e+00 | US |
| US7491532 | 7491532 | NA | J01976 | ACCESSION | 86 | 92 | 0.000000e+00 | US |
| US7491532 | 7491532 | NA | J01977 | ACCESSION | 93 | 99 | 0.000000e+00 | US |
| US7491532 | 7491532 | NA | J01978 | ACCESSION | 100 | 106 | 0.000000e+00 | US |
| US7491532 | 7491532 | NA | J01979 | ACCESSION | 107 | 113 | 0.000000e+00 | US |
| US7491532 | 7491532 | NA | K00515 | ACCESSION | 114 | 120 | 0.000000e+00 | US |
| US7491532 | 7491532 | NA | V00025 | ACCESSION | 121 | 127 | 0.000000e+00 | US |
| US7491532 | 7491532 | NA | V00026 | ACCESSION | 128 | 134 | 0.000000e+00 | US |
| US7491532 | 7491532 | NA | V00027 | ACCESSION | 135 | 141 | 0.000000e+00 | US |
| US7491532 | 7491532 | NA | V00029 | ACCESSION | 142 | 148 | 0.000000e+00 | US |
| US8672839 | 8672839 | NA | Surgical Port Assembly | ORG | 55 | 77 | 0.000000e+00 | US |
| US8048420 | 8048420 | embl | EBI | ORG | 99 | 102 | 1.722845e+18 | US |
In the second step we focused on correcting the annotations, adding the SEQ label and adjusting and improving the entity ruler. This is performed using the ner.correct recipe in Prodigy using the model built in step 1 with the entity ruler added.
The SEQ label was added at this stage because Step 1 annotations revealed that accession numbers are commonly located in close proximity to SEQ ID numbers (that is, in the same sentence or paragraph).
# note that names of the sets got mixed up here and requires correction
prodigy ner.correct accession_refs_correct en_accession_ruler ./accession_sent_nchar2000.csv --label ORG,ACCESSION,SEQ -U
prodigy ner.correct accession_nchar_correct en_accession_ruler ./accession_refs.csv --label ORG,ACCESSION,SEQ -U
The corrected annotation sets are then exported to spaCy format as the new corpus with the 70% and 30% split.
accession % prodigy data-to-spaCy corpus --ner accession_refs_correct,accession_nchar_correct --eval-split 0.3
In the next step a new model is trained and packaged with the adjusted entity_ruler. Note here that a total of 1436 annotations were made. spaCy automatically compacted these to 1356 with Training: 953 | Evaluation: 403. We can now also produce some more detailed evaluation measures on the performance of the model for each label.
We can see in the results that overall the model has an F-Score of 92.04 against the 403 unseen records in the test set. We can also see the per label scores under NER. The top performing label is ORG followed by ACCESSION and SEQ.
Note that the recall for the ACCESSION label is lower than we would like as the main target of this exercise and the SEQ label is lagging behind the other two.
These results are to be expected for three reasons. First, the pre-trained model already includes the ORG label. We should therefore expect improvements to the ORG label scores. In contrast the Accession label is new and the lower recall scores perhaps reflect some of the difficulties involved in distinguishing one set of alpha numeric or pure numeric strings from others.
The difficulties experienced by the SEQ label probably reflect the
variations from the SEQ ID NO: 1 pattern to
SEQ ID NO: 1, 2, 3, 4 or 5 and expose a limitation of token
based models in cases where an entity may span multiple tokens. That is,
where an entity is captured in one or two tokens a model will perform
well. Where a entity spans multiple tokens with significant variations
in the number of tokens then a model will struggle. This is a recognised
issue and spaCy now includes a span categorizer to assist with
addressing entities with varying span lengths.
This code runs a function in R get_entities() by calling
a python script in inst/label_script.py that writes the
results to a csv file. Note that you must have a python environment
containing the model attached to your RStudio environment (Set in
Project Options or GLobal Options to set the path).
large <- get_entities(path = "assets/accession_ref_all_id.csv", model = "en_accession_ruler", dest = "results/accession_ref_all_id_results.csv")
Read the dataset back in and count up the results for the labels.
#large <- read_csv("results/accession_ref_all_id_results.csv")
read_csv("data/large_count.csv") %>%
knitr::kable()
| entity_label | n |
|---|---|
| ACCESSION | 91382 |
| ORG | 125245 |
| SEQ | 1468 |
Inspect the ORG label.
read_csv("data/large_org.csv") %>%
head(., 30) %>%
knitr::kable()
## Rows: 8858 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): entity_text
## dbl (1): n
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
| entity_text | n |
|---|---|
| GenBank | 21376 |
| NCBI | 11496 |
| EBI | 7545 |
| EMBL | 5784 |
| Genbank | 5580 |
| National Center for Biotechnology Information | 4905 |
| American Chemical Society | 3417 |
| UniProt | 2692 |
| Journal of Biological Chemistry | 2204 |
| National Institutes of Health | 2112 |
| ACM | 1884 |
| UNIPROT | 1626 |
| American Physiological Society | 1050 |
| American Heart Association | 937 |
| ATCC | 716 |
| Uniprot | 708 |
| American Society of Nephrology | 498 |
| American Heart Association, Inc. | 446 |
| GNPD | 443 |
| American Institute of Physics | 425 |
| Semiconductor Chip Assembly | 407 |
| Swiss-Prot | 399 |
| American Journal of Physiology | 381 |
| IEEE Computer Society | 338 |
| ACS Nano | 336 |
| American Journal of Kidney Diseases | 329 |
| Senate | 287 |
| USENIX | 285 |
| RTC Industries, Inc. | 276 |
| American Society for Microbiology | 268 |
Inspect the ACCESSION label.
read_csv("data/large_accesion.csv") %>%
head(., 30) %>%
knitr::kable()
## Rows: 31435 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): entity_text
## dbl (1): n
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
| entity_text | n |
|---|---|
| PCT | 497 |
| 03C 3137 | 393 |
| AB 2011 | 349 |
| 4 | 250 |
| LG Optimus | 249 |
| Angewandte | 239 |
| BACE1 | 169 |
| ACM | 163 |
| www.clinicaltrials.gov/ct2/show/NCT01390831 | 154 |
| Angove | 153 |
| R1524 | 153 |
| Dermatol | 151 |
| Medtronic | 131 |
| R1517 | 129 |
| P15647 | 124 |
| J05199 | 122 |
| 100300 | 117 |
| R1599 | 115 |
| EMC Corporation | 113 |
| 4th | 108 |
| ASME | 108 |
| LG KE850 | 108 |
| 5th | 107 |
| 11β | 100 |
| 11669 | 99 |
| 7356 | 98 |
| 8220;Semiconductor Chip Assembly | 98 |
| Samsung F700 | 98 |
| AAB10482 | 94 |
| F1209 | 92 |
Inspect the SEQ label.
read_csv("data/large_accesion.csv") %>%
head(., 30) %>%
knitr::kable()
## Rows: 31435 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): entity_text
## dbl (1): n
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
| entity_text | n |
|---|---|
| PCT | 497 |
| 03C 3137 | 393 |
| AB 2011 | 349 |
| 4 | 250 |
| LG Optimus | 249 |
| Angewandte | 239 |
| BACE1 | 169 |
| ACM | 163 |
| www.clinicaltrials.gov/ct2/show/NCT01390831 | 154 |
| Angove | 153 |
| R1524 | 153 |
| Dermatol | 151 |
| Medtronic | 131 |
| R1517 | 129 |
| P15647 | 124 |
| J05199 | 122 |
| 100300 | 117 |
| R1599 | 115 |
| EMC Corporation | 113 |
| 4th | 108 |
| ASME | 108 |
| LG KE850 | 108 |
| 5th | 107 |
| 11β | 100 |
| 11669 | 99 |
| 7356 | 98 |
| 8220;Semiconductor Chip Assembly | 98 |
| Samsung F700 | 98 |
| AAB10482 | 94 |
| F1209 | 92 |